Adaptive Lambda Least-Squares Temporal Difference Learning

نویسندگان

  • Timothy Arthur Mann
  • Hugo Penedones
  • Shie Mannor
  • Todd Hester
چکیده

Temporal Difference learning or TD(λ) is a fundamental algorithm in the field of reinforcement learning. However, setting TD’s λ parameter, which controls the timescale of TD updates, is generally left up to the practitioner. We formalize the λ selection problem as a bias-variance trade-off where the solution is the value of λ that leads to the smallest Mean Squared Value Error (MSVE). To solve this tradeoff we suggest applying Leave-One-Trajectory-Out CrossValidation (LOTO-CV) to search the space of λ values. Unfortunately, this approach is too computationally expensive for most practical applications. For Least Squares TD (LSTD) we show that LOTO-CV can be implemented efficiently to automatically tune λ and apply function optimization methods to efficiently search the space of λ values. The resulting algorithm, ALLSTDis parameter free and our experiments demonstrate that ALLSTD is significantly computationally faster than the naı̈ve LOTO-CV implementation while achieving similar performance. The problem of policy evaluation is important in industrial applications where accurately measuring the performance of an existing production system can lead to large gains (e.g., recommender systems (Shani and Gunawardana, 2011)). Temporal Difference learning or TD(λ) is a fundamental policy evaluation algorithm derived in the context of Reinforcement Learning (RL). Variants of TD are used in SARSA (Sutton and Barto, 1998), LSPI (Lagoudakis and Parr, 2003), DQN (Mnih et al., 2015), and many other popular RL algorithms. The TD(λ) algorithm estimates the value function for a policy and is parameterized by λ ∈ [0, 1], which averages estimates of the value function over future timesteps. The λ induces a bias-variance trade-off. Even though tuning λ can have significant impact on performance, previous work has generally left the problem of tuning λ up to the practitioner (with the notable exception of (White and White, 2016)). In this paper, we consider the problem of automatically tuning λ in a data-driven way. Defining the Problem: The first step is defining what we mean by the “best” choice for λ. We take the λ value that minimizes the MSVE as the solution to the bias-variance trade-off. Proposed Solution: An intuitive approach is to estimate MSE for a finite set Λ ⊂ [0, 1] and chose the λ ∈ Λ that minimizes an estimate of MSE. Score Values in Λ: We could estimate the MSE with the loss on the training set, but the scores can be misleading due to overfitting. An alternative approach would be to estimate the MSE for each λ ∈ Λ via Cross Validation (CV). In particular, in the supervized learning setting Leave-One-Out (LOO) CV gives an almost unbiased estimate of the loss (Sugiyama et al., 2007). We develop Leave-One-Trajectory-Out (LOTO) CV, but unfortunately LOTO-CV is too computationally expensive for many practical applications. Efficient Cross-Validation: We show how LOTO-CV can be efficiently implemented under the framework of Least Squares TD (LSTD(λ) and Recursive LSTD(λ)). Combining these ideas we propose Adaptive λ Least-Squares Temporal Difference learning (ALLSTD). While a naı̈ve implementation of LOTO-CV requires O(kn) evaluations of LSTD, ALLSTD requires onlyO(k) evaluations, where n is the number of trajectories and k = |Λ|. Our experiments demonstrate that our proposed algorithm is effective at selecting λ to minimize MSE. In addition, the experiments demonstrate that our proposed algorithm is significantly computationally faster than a naı̈ve implementation. Contributions: The main contributions of this work are: 1. Formalize the λ selection problem as finding the λ value that leads to the smallest Mean Squared Value Error (MSVE), 2. Develop LOTO-CV and propose using it to search the space of λ values, 3. Show how LOTO-CV can be implemented efficiently for LSTD, 4. Introduce ALLSTD that is significantly computationally faster than the naı̈ve LOTO-CV implementation, and 5. Prove that ALLSTD converges to the optimal hypothesis. Background Let M = 〈S,A, P, r, γ〉 be a Markov Decision Process (MDP) where S is a countable set of states, A is a finite set of actions, P (s′|s, a) maps each state-action pair (s, a) ∈ S × A to the probability of transitioning to s′ ∈ S in a single timestep, r is an |S| dimensional vector mapping each state s ∈ S to a scalar reward, and γ ∈ [0, 1] is the discount factor. We assume we are given a function φ : S → R ar X iv :1 61 2. 09 46 5v 1 [ cs .L G ] 3 0 D ec 2 01 6 that maps each state to a d-dimensional vector, and we denote by X = φ(S) a d × |S| dimensional matrix with one column for each state s ∈ S. Let π be a stochastic policy and denote by π(a|s) the probability that the policy executes action a ∈ A from state s ∈ S. Given a policy π, we can define the value function

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Locally Weighted Least Squares Temporal Difference Learning

This paper introduces locally weighted temporal difference learning for evaluation of a class of policies whose value function is nonlinear in the state. Least squares temporal difference learning is used for training local models according to a distance metric in state-space. Empirical evaluations are reported demonstrating learning performance on a number of strongly non-linear value function...

متن کامل

Sustainable ℓ2-regularized actor-critic based on recursive least-squares temporal difference learning

Least-squares temporal difference learning (LSTD) has been used mainly for improving the data efficiency of the critic in actor-critic (AC). However, convergence analysis of the resulted algorithms is difficult when policy is changing. In this paper, a new AC method is proposed based on LSTD under discount criterion. The method comprises two components as the contribution: (1) LSTD works in an ...

متن کامل

Ensembles of extreme learning machine networks for value prediction

Value prediction is an important subproblem of several reinforcement learning (RL) algorithms. In a previous work, it has been shown that the combination of least-squares temporal-difference learning with ELM (extreme learning machine) networks is a powerful method for value prediction in continuous-state problems. This work proposes the use of ensembles to improve the approximation capabilitie...

متن کامل

Least-squares temporal difference learning based on extreme learning machine

This paper proposes a least-squares temporal difference (LSTD) algorithm based on extreme learning machine that uses a singlehidden layer feedforward network to approximate the value function. While LSTD is typically combined with local function approximators, the proposed approach uses a global approximator that allows better scalability properties. The results of the experiments carried out o...

متن کامل

Kernel Recursive Least-Squares Temporal Difference Algorithms with Sparsification and Regularization

By combining with sparse kernel methods, least-squares temporal difference (LSTD) algorithms can construct the feature dictionary automatically and obtain a better generalization ability. However, the previous kernel-based LSTD algorithms do not consider regularization and their sparsification processes are batch or offline, which hinder their widespread applications in online learning problems...

متن کامل

Incremental Least-Squares Temporal Difference Learning

Approximate policy evaluation with linear function approximation is a commonly arising problem in reinforcement learning, usually solved using temporal difference (TD) algorithms. In this paper we introduce a new variant of linear TD learning, called incremental least-squares TD learning, or iLSTD. This method is more data efficient than conventional TD algorithms such as TD(0) and is more comp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1612.09465  شماره 

صفحات  -

تاریخ انتشار 2016